Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction, but the main LM benchmarks are non-interactive, where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21's J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
translated by 谷歌翻译
With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerators and the use of large compute clusters. However, the gain in prediction accuracy from large models trained on distributed and accelerated systems comes at the price of a substantial increase in energy demand, and researchers have started questioning the environmental friendliness of such AI methods at scale. Consequently, energy efficiency plays an important role for AI model developers and infrastructure operators alike. The energy consumption of AI workloads depends on the model implementation and the utilized hardware. Therefore, accurate measurements of the power draw of AI workflows on different types of compute nodes is key to algorithmic improvements and the design of future compute clusters and hardware. To this end, we present measurements of the energy consumption of two typical applications of deep learning models on different types of compute nodes. Our results indicate that 1. deriving energy consumption directly from runtime is not accurate, but the consumption of the compute node needs to be considered regarding its composition; 2. neglecting accelerator hardware on mixed nodes results in overproportional inefficiency regarding energy consumption; 3. energy consumption of model training and inference should be considered separately - while training on GPUs outperforms all other node types regarding both runtime and energy consumption, inference on CPU nodes can be comparably efficient. One advantage of our approach is that the information on energy consumption is available to all users of the supercomputer, enabling an easy transfer to other workloads alongside a raise in user-awareness of energy consumption.
translated by 谷歌翻译
磁共振成像(MRI)是中风成像的中心方式。它被用来接受患者的治疗决定,例如选择患者进行静脉溶栓或血管内治疗。随后在住院期间使用MRI来通过可视化梗塞核心大小和位置来预测结果。此外,它可以用来表征中风病因,例如(心脏) - 栓塞和非胚胎中风之间的区分。基于计算机的自动医疗图像处理越来越多地进入临床常规。缺血性中风病变分割(ISLE)挑战的先前迭代有助于生成鉴定急性和急性缺血性中风病变分割的基准方法。在这里,我们介绍了一个专家注册的多中心MRI数据集,以分割急性到亚急性中风病变。该数据集包括400个多供应商MRI案例,中风病变大小,数量和位置的可变性很高。它分为n = 250的训练数据集和n = 150的测试数据集。所有培训数据将公开可用。测试数据集将仅用于模型验证,并且不会向公众发布。该数据集是Isles 2022挑战的基础,目的是找到算法方法,以实现缺血性中风的稳健和准确分割算法的开发和基准测试。
translated by 谷歌翻译
Foundation Models (FMs) are models trained on large corpora of data that, at very large scale, can generalize to new tasks without any task-specific finetuning. As these models continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This paper aims to understand an underexplored area of FMs: classical data tasks like cleaning and integration. As a proof-of-concept, we cast five data cleaning and integration tasks as prompting tasks and evaluate the performance of FMs on these tasks. We find that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks. We identify specific research challenges and opportunities that these models present, including challenges with private and domain specific data, and opportunities to make data management systems more accessible to non-experts. We make our code and experiments publicly available at: https://github.com/HazyResearch/fm_data_tasks.
translated by 谷歌翻译
本工作详细介绍了3D级不变特征变换(SIFT)算法的高效实现,用于从大组体积的体积图像数据的机器学习的目的。 3D SIFT代码的主要操作在图形处理单元(GPU)上实现,包括从刻度空间金字塔的卷积,子采样和4D峰值检测。使用3D MRI人脑体积的不同人的3D MRI人脑体积来量化性能改进。基于二进制强大的独立基本特征(简要)代码提出了计算有效的3D Keypoint描述符,包括新颖的描述符,我们调用排名强大的独立基本特征(Rrief),并与原始3D Sift-andal方法\ CITEP {Toews2013 effity}相比。 。 GPU实现提供了超出优化CPU实现的大约7倍的加速,其中33秒到0.2秒,用于具有大约3000个关键点的3D尺寸(145,174,145)体素的3D卷到0.2秒。值得注意的加速包括卷积操作(20x),4d峰值检测(3x),子采样(3x)和高斯金字塔结构(2x)。高效的描述符与标准SIFT-RANDS描述符相比,使用2x的加速和6倍的内存节省,以减少的关键点对应关系,在计算效率和算法性能之间揭示折衷。我们实现的加速将允许对较大数据集进行更有效的分析。我们的优化GPU实现了3D Sift-Rank Extractor的HTTPS://github.com/carluerjb/3d_sift_cuda可用。
translated by 谷歌翻译
自治车辆和机器人需要越来越多的鲁棒性和可靠性,以满足现代任务的需求。这些要求特别适用于相机,因为它们是获取环境和支持行动的信息的主要传感器。相机必须保持适当的功能,并在必要时采取自动对策。但是,几乎没有作品,审查了相机的一般情况监测方法的实际应用,并在设想的高级别应用程序中设计对策。我们为基于数据和物理接地模型的相机提出了一种通用和可解释的自我保健框架。为此,我们通过比较传统和血液的机器学习的方法,确定一种可靠的两种可靠,实时的估计,用于诸如难以释放的情况(Defocus Blur,运动模糊,不同噪声现象和最常见的噪声现象和最常见的组合)的典型图像效果广泛的实验。此外,我们展示了如何根据实验(非线性和非单调)输入 - 输出性能曲线来调整相机参数(例如,曝光时间和ISO增益)以实现最佳的全系统能力,使用对象检测,运动模糊和传感器噪声作为示例。我们的框架不仅提供了一种实用的即用的解决方案,可以评估和维护摄像机的健康,但也可以作为扩展来解决更复杂的问题的基础,以凭经验组合附加的数据源(例如,传感器或环境参数或环境参数)为了获得完全可靠和强大的机器。
translated by 谷歌翻译
在过去的几十年里,机器和深度学习界在挑战性的任务中庆祝了巨大成就,如图像分类。人工神经网络的深度建筑与可用数据的宽度一起使得可以描述高度复杂的关系。然而,仍然不可能完全捕捉深度学习模型已经了解到的深度学习模型并验证它公平,而不会产生偏见,特别是在临界任务中,例如在医学领域产生的问题。这样的任务的一个示例是检测面部图像中的不同面部表情,称为动作单位。考虑到这项特定任务,我们的研究旨在为偏见提供透明度,具体与性别和肤色有关。我们训练一个神经网络进行动作单位分类,并根据其准确性和基于热量的定性分析其性能。对我们的结果的结构化审查表明我们能够检测到偏见。尽管我们不能从我们的结果得出结论,但较低的分类表现完全来自性别和肤色偏差,这些偏差必须得到解决,这就是为什么我们通过提出关于如何避免检测到的偏差的建议。
translated by 谷歌翻译